72 research outputs found
Approximately Minwise Independence with Twisted Tabulation
A random hash function is -minwise if for any set ,
, and element , .
Minwise hash functions with low bias have widespread applications
within similarity estimation.
Hashing from a universe , the twisted tabulation hashing of
P\v{a}tra\c{s}cu and Thorup [SODA'13] makes lookups in tables of size
. Twisted tabulation was invented to get good concentration for
hashing based sampling. Here we show that twisted tabulation yields -minwise hashing.
In the classic independence paradigm of Wegman and Carter [FOCS'79] -minwise hashing requires -independence [Indyk
SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple
tabulation, using same space and lookups yields -minwise
independence, which is good for large sets, but useless for small sets. Our
analysis uses some of the same methods, but is much cleaner bypassing a
complicated induction argument.Comment: To appear in Proceedings of SWAT 201
Comparison of Failures and Attacks on Random and Scale-Free Networks
It appeared recently that some statistical properties of complex networks like the Internet, the World Wide Web or Peer-to-Peer systems have an important influence on their resilience to failures and attacks. In particular, scale-free networks (i.e. networks with power-law degree distribution) seem much more robust than random networks in case of failures, while they are more sensitive to attacks. In this paper we deepen the study of the differences in the behavior of these two kinds of networks when facing failures or attacks. We moderate the general affirmation that scale-free networks are much more sensitive than random networks to attacks by showing that the number of links to remove in both cases is similar, and by showing that a slightly modified scenario for failures gives results similar to the ones for attacks. We also propose and analyze an efficient attack strategy against links
Dictionary matching in a stream
We consider the problem of dictionary matching in a stream. Given a set of
strings, known as a dictionary, and a stream of characters arriving one at a
time, the task is to report each time some string in our dictionary occurs in
the stream. We present a randomised algorithm which takes O(log log(k + m))
time per arriving character and uses O(k log m) words of space, where k is the
number of strings in the dictionary and m is the length of the longest string
in the dictionary
Counting approximately-shortest paths in directed acyclic graphs
Given a directed acyclic graph with positive edge-weights, two vertices s and
t, and a threshold-weight L, we present a fully-polynomial time
approximation-scheme for the problem of counting the s-t paths of length at
most L. We extend the algorithm for the case of two (or more) instances of the
same problem. That is, given two graphs that have the same vertices and edges
and differ only in edge-weights, and given two threshold-weights L_1 and L_2,
we show how to approximately count the s-t paths that have length at most L_1
in the first graph and length at most L_2 in the second graph. We believe that
our algorithms should find application in counting approximate solutions of
related optimization problems, where finding an (optimum) solution can be
reduced to the computation of a shortest path in a purpose-built auxiliary
graph
Scalable Mining of Common Routes in Mobile Communication Network Traffic Data
A probabilistic method for inferring common routes from mobile communication network traffic data is presented. Besides providing mobility information, valuable in a multitude of application areas, the method has the dual purpose of enabling efficient coarse-graining as well as anonymisation by mapping individual sequences onto common routes. The approach is to represent spatial trajectories by Cell ID sequences that are grouped into routes using locality-sensitive hashing and graph clustering. The method is demonstrated to be scalable, and to accurately group sequences using an evaluation set of GPS tagged data
Cross-language high similarity search using a conceptual thesaurus
This work addresses the issue of cross-language high similarity and
near-duplicates search, where, for the given document, a highly similar one is to
be identified from a large cross-language collection of documents. We propose
a concept-based similarity model for the problem which is very light in computation
and memory. We evaluate the model on three corpora of different nature
and two language pairs English-German and English-Spanish using the Eurovoc
conceptual thesaurus. Our model is compared with two state-of-the-art models
and we find, though the proposed model is very generic, it produces competitive
results and is significantly stable and consistent across the corpora.This work was done in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and it has been partially
funded by the European Commission as part of the WIQ-EI IRSES project (grant no.
269180) within the FP 7 Marie Curie People Framework, and by the Text-Enterprise
2.0 research project (TIN2009-13391-C04-03). The research work of the second author
is supported by the CONACyT 192021/302009 grantGupta, P.; Barrón Cedeño, LA.; Rosso, P. (2012). Cross-language high similarity search using a conceptual thesaurus. En Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. Springer Verlag (Germany). 7488:67-75. https://doi.org/10.1007/978-3-642-33247-0_8S6775748
Analysis of Agglomerative Clustering
The diameter -clustering problem is the problem of partitioning a finite
subset of into subsets called clusters such that the maximum
diameter of the clusters is minimized. One early clustering algorithm that
computes a hierarchy of approximate solutions to this problem (for all values
of ) is the agglomerative clustering algorithm with the complete linkage
strategy. For decades, this algorithm has been widely used by practitioners.
However, it is not well studied theoretically. In this paper, we analyze the
agglomerative complete linkage clustering algorithm. Assuming that the
dimension is a constant, we show that for any the solution computed by
this algorithm is an -approximation to the diameter -clustering
problem. Our analysis does not only hold for the Euclidean distance but for any
metric that is based on a norm. Furthermore, we analyze the closely related
-center and discrete -center problem. For the corresponding agglomerative
algorithms, we deduce an approximation factor of as well.Comment: A preliminary version of this article appeared in Proceedings of the
28th International Symposium on Theoretical Aspects of Computer Science
(STACS '11), March 2011, pp. 308-319. This article also appeared in
Algorithmica. The final publication is available at
http://link.springer.com/article/10.1007/s00453-012-9717-
- …